Skip to content

Adds literal decoding variant with a per-stream LDS cache to coalesce memory writes through transposition.#72

Draft
pm4rtx wants to merge 2 commits intomicrosoft:developmentfrom
pm4rtx:unroll-huffman-decode
Draft

Adds literal decoding variant with a per-stream LDS cache to coalesce memory writes through transposition.#72
pm4rtx wants to merge 2 commits intomicrosoft:developmentfrom
pm4rtx:unroll-huffman-decode

Conversation

@pm4rtx
Copy link
Collaborator

@pm4rtx pm4rtx commented Feb 18, 2026

This PR makes literal decoding a bit more memory friendly and avoids scattered one byte per thread writes into N destination location, one per processed stream. Instead, it accumulates four decoded bytes with aligned destination addresses into a dword and then stores dwords from each processed stream into LDS. When it becomes full or the last full dword is formed, dwords from are flushed from LDS to memory cooperatively by the entire threadgroup making coalesced writes.

This new variant of the shader also reduce LDS usage to store Huffman table by a half (from 2048 to 1024 dwords). This is still not ideal (768 dwords), but better and allows to recuperate some LDS space to put per-stream data cache there.

@pm4rtx pm4rtx self-assigned this Feb 18, 2026
@coopp
Copy link
Collaborator

coopp commented Feb 18, 2026

Getting back LDS space is goodness all around. Looks great to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments